arXiv:1508.01596vl [stat.ML] 7 Aug 2015 


Sublinear Partition Estimation 


Pushpendre Rastogi Benjamin Van Durme 

Johns Hopkins University 


Abstract 


The output scores of a neural network classifier are converted to probabilities via 
normalizing over the scores of all competing categories. Computing this partition 
function, Z, is then linear in the number of categories, which is problematic as 
real-world problem sets continue to grow in categorical types, such as in visual 
object recognition or discriminative language modeling. We propose three ap¬ 
proaches for sublinear estimation of the partition function, based on approximate 
nearest neighbor search and kernel feature maps and compare the performance of 
the proposed approaches empirically. 


1 Introduction 


Neural Networks (and log-linear models) have out-performed other machine learning frameworks 
on a number of difficult multi-class classification tasks such as object recognition in images, large 
vocabulary speech recognition and many others lfl9l 1_0 25 ]. These classification tasks become 
“large scale” as the number of potential classes/categories increases. For instance, discriminative 
language models need to choose the most probable word from the entire vocabulary which can have 
more than 100,000 words in case of the English language. And in the field of computer vision, 
the latest datasets for object recognition contain more than 10,000 object classes and the number of 
categories is increasing every year 0. 

In certain applications neural networks may be used as sub-systems of a larger model in which case 
it may be necessary to convert the unnormalized score assigned to a class by a neural network to a 
probability value. To perform this conversion we need to compute the so-called Partition Function 
of a neural network which is just a sum of the scores assigned by the neural network to all the classes. 
Let N be the number of output classes and let q , where v, represents weights of ith class. 

Also, let Ui be the score of the i th class, i.e. Ui = Vi- q then the partition function Z(q) is defined as: 

N 

Z{q) = ^2exp{ui) ( 1 ) 

i—1 


The problem of assigning a probability to the most probable class can be stated as: 


Find, i = arg max m 

i 


p{i) 


exp (uf) 

~zW 


( 2 ) 

(3) 


Recently, ED Elia have presented methods for solving 0 by building upon frameworks for 
performing fast randomized nearest neighbor searches such as Locality Sensitive Hashing and Ran¬ 
domized k-d trees. This paper presents methods for estimating the value of Z(q) required in 0. 
While brute force parallelization is one effective strategy for reducing the time needed for this com¬ 
putation our goal is to estimate the partition function in asymptotically lesser runtime. 
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2 Previous Work 


A number of techniques have been used previously for speeding up the computation of the partition 
function in artificial neural networks. Computational efficiency of the normalizing step seems to be 
especially important for these tasks because of the large size of vocabulary needed to achieve state 
of the art performance in these tasks. The prior work can be categorized by the technique it uses as 
follows: 

Importance Sampling: The partition function can be written as Z = iVEfexj)('«.,)] where the 
distribution of i is uniform over [1,..., N], However, the attempts to estimate Z by simply drawing 
k samples from the uniform distribution and replacing the expectation by its sample estimate are 
marred by the high variance of the estimate. E9 were the first to use importance sampling for 
reducing the variance of the sample average estimator. Their aim was to speed up the training 
of neural language models and they used an n-gram language model as the proposal distribution. 
Though they do not use their method for actually computing the partition function at inference time, 
their method could be easily extended for that purpose. The problem however with their method 
is that it requires the use of an external model for constructing the proposal distribution. Such an 
external model requires extra engineering and knowledge, specific to the problem domain, that may 
not be available. 

Hierarchical decomposition: fT3l introduced a method for breaking the original TV-way decision 
problem into a hierarchical one such that it requires only 0(\og(N)) computations. That method 
requires a change in the model from a single decision problem into a chain of decision problems 
computed over a tree. Also, since there is no a priori single most preferable way of growing a tree 
for doing this computation therefore an external model is needed for creating the hierarchical tree. 

Self-Normalization: Some of the prior work side-steps the problem of computing the partition 
function and trains neural networks with the added constraint that the partition function should 
remain close to 1 for inputs seen during test time. 

m used Noise Contrastive EstimatiorQwith a heuristic to clamp the values of E to 1 during train¬ 
ing. They demonstrated empirically that by doing so the partition function at test time also remained 
close to 1 though they did not provide any theoretical analysis on how close to 1 the value of the 
partition function remains at test time. 

On the other hand, El added a penalty term log {Z(q))~ to the training objective of the neu¬ 
ral network and empirically demonstrated on a large scale machine translation task that they do 
not suffer from a large loss in accuracy even if they assume that the value of the partition func¬ 
tion is close to 1 for all inputs at test time. Recently, Q] showed that after training on n train¬ 
ing examples, with probability 1 — A the expected value of log (Z(q)) lies in an interval of size 

4 (\/ dN los t dBfl ") +log Il2 _|_ 1 ^ centered at ^ i=1 lo s( z M) jj ere jy anc | ( j are t jj e num b er 0 f out- 

\ V 2n n I n 

put classes and number of features in a loglinear model and B and R are upper bounds on the infinity 
norm of Vi and q respectively. 


3 Background 


Maximum Inner Product Search (MIPS): MIPS refers to the problem of finding points that have 
the highest inner product with an input query vector]^ Let us define Sk(q) to be the set of k vectors 
that have the highest inner product with the vector q and let us assume for simplicity that MIPS 
algorithms allow us to retrieve Sk(q) for arbitrary k and q in sublinear time. The exact order of 
runtime depends on the dataset and the indexing algorithm chosen for retrieval. For example, one 
could use the popular library FLANN |16j [15) or PCA-Trees(24) or LSH itself(7) for retrieving 
Sk(q )■ Also, J9J presented a measure of hardness of the dataset for nearest neighbor algorithms. 

mm and G3 presented methods for MIPS based on Asymmetric Locality Sensitive Hashing 
(LSH) i.e., they used two separate hash functions for the query and data. A different approach 
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Please see section [5] for a brief overview and section [d3| for a detailed explanation of NCE. 


Note that simply by querying for —q one can also find the vectors with the smallest inner product. 
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that relies on reducing the problem of performing maximum inner product search over a set of d- 
dimensional vectors to the problem of performing nearest neighbor search in Euclidean distance 
over a set of d +1 dimensional vectors was presented by mm These algorithms enable us to draw 
high probability samples from the unnormalized distribution over i £ [1,..., TV] induced by q and 
we will use this property heavily for creating our estimators. 

NCE: NCE was introduced by |[8] as an objective function that can be computed and optimized more 
efficiently than the likelihood objective in cases where normalizing a distribution is an expensive 
operation. In the same paper they proved that the NCE objective has a unique maxima, and that 
it achieves that maxima for the same parameter values that maximize the true likelihood function. 
Moreover the normalization constant itself can be estimated as an outcome of the optimization. The 
NCE objective relies on at least a single sample being available from the true distribution which may 
be unnormalized and a noise distribution which should be normalized. 

Kernel Feature Maps: The function exp(u, -q) is a kernel that depends on the dot product of v, and 
q, therefore, exp ('a, -q) is a dot product kernel EOllTO. Every kernel that satisfies certain conditionsfj 
also has an associated feature map such that the kernel can be decomposed as a countable sum of 
products of feature functions lf23l : 

3A j(j>j : k(x,x') = Xj(f>j(x)(l)j(x') 

i ew 

If the values of A j decrease fast enough then one could approximate the exp kernel up to some small 
tolerance by a finite summation of its feature maps as follows: 

p 

exp (Vi ^j(vi)(t>j(q) 

j =i 

Log Normal Distribution of Z: If we assume that q ~ A/(/x, E) then u, is also a normal random 
variable with distribution Af(vj /x, vj Et;*) and exp (m) is log-normal distributed. In this case, Z(q) 
is the sum of N dependent log-normal random variables. There is no analytical formula known for 
the distribution of Z , however, in general it is known that the distribution of Z is governed by the 
distribution of the max(itj) when Z is high enough due to a result by j2jj^] This suggests that one 
could reasonably estimate Z when its value is high enough, by exponentiating and then summing 
only the top few u t . Unfortunately, when the value of Z is not very large, then all u, become 
significant for calculating Z. E.g. consider the pathological case that \q\ = 0 which means that 
Ui = 0 V* G [1,..., N], In practice, for most values of q, the value of Z(q) is not large enough to 
ignore the contributions due to the tail of tq. 

4 Methods 

4.1 MIMPS: MIPS Based Importance Sampling 

In Section [2] we discussed the importance sampling based approach for estimating the partition 
function and pointed out that the work so far relies on the presence of an external model or pro¬ 
posal distribution that can produce samples from the high probability region. However, by utilizing 
the algorithms for solving MIPS problem, we can overcome problems of engineering proposal dis¬ 
tributions, since we can retrieve the set Sk{q) (See Section pj). A naive estimator, which we call 
Naive MIMPS or NMIMPS, that utilizes Sk(q) is the following: 

Znmimps = ex P( s ‘ ?) (4) 

seSfc(g) 

Unfortunately NMIMPS requires k to be very high and is not realistic. Let Ui represent a set of l vec¬ 
tors sampled uniformly from amongst the vectors that are not in then a better way of estimating 

3 It is sufficient for a kernel to be analytic, and have positive coefficients in its Taylor expansion around zero. 
These conditions are satisfied by the exp function. 

4 They show that JjjP, equals the number of u, that have the highest variance and highest mean. 

Here F(x) is the tail of a log-normal CDF. 
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Z is: 


Zmimps = 


E 

sGS k (q) 


exp(s • q) + 


N — k 


l 


u&Ui 


exp (u ■ q ) 


(5) 


In effect we are assuming that the values at the tail end of the probability distribution lie in a small 
range and thus a small sample size still has a small variance. A better estimator could be created by 
modeling the tail of the probability distribution, perhaps as a power law curve. 


4.2 MINCE: MIPS Based NCE 


NCE is a general parameter estimation technique that can be used any where in place of maximum 
likelihood estimation. Specifically if we consider the values of the partition function to be a param¬ 
eter of the unnormalized distribution over i induced by q then by generating samples from the true 
distribution and a noise distribution ideally we can estimate it. Since NCE requires samples from 
the true distribution which we can generate by querying for Sk(q) therefore methods that perform 
MIPS can be used for estimating the value of Z(q) as well. If our noise distribution is uniform over 
the N — k vectors not present in Sk then the NCE objective is; Zmince = argmax^ J(Z), where: 


where J(Z) 


E lo s( 


exp(s • q)/Z 


seS k (q) 


exp(s • q)/Z ■ 


-) + E lo s( 


l 1 
k N-k 


k N-k 


leUi 


exp (l • q)/Z 


i l - 
k N-k 


(6) 


It is worthy to note that if we let a s = (exp(s • q)k(N — k)/l and analogously define bi then the 
objective simplifies into a very convenient form shown in 0 of which even the third derivatives can 
be found efficiently. Efficient computation of the third derivative utilized through Halley’s method, 
leads to considerable speedup during optimization compared to using only the second derivatives 
and Newton’s method. 


T N 

—J(Z) = log (Z/ a» + 1) + log {bj/Z + 1) (7) 

t= 1 3 =1 

We also briefly note that one way of estimating the partition function could be to assume a param- 
eteric form on the output distribution and then to use Maximum likelihood estimation which is the 
most efficient estimator possible when the form of the distribution is known. However even though 
individual class scores u, follow the lognormal distribution it is not clear how one could use MLE 
for computing the partition function in our setting. 

4.3 FMBE: Feature Map Based Estimation 

In Section [3] we sketched how kernels could be linearized into a sum over products of feature maps. 
This decomposition of a kernel can be utilized for speeding up the computation of Z as follows: 

N N P P / N 

E eX P( Wi '?)~EE X 3<t>i( V i)<t>j{(i) = E X 3 E 

2=1 2=1 j = 1 j — 1 \ 2=1 

N P 

Let: A, = X :j ^2 4>j{vi) then Z(q) = ^2 (8) 

i=i j=i 

Essentially, one could precompute A j during training and reduce the O(N) summation to 0(P) 
along with a constant factor, say p, needed to compute <b :r Note that even though computing A y 
involves the mapping <j> which in general is unknown. Let us now detail how we would compute 
(f>. Overall this scheme would lead to savings in time if Pp < Nd. Although |[23l gave explicit 
formulas for deriving the eigenvalues A j and eigen-functions T, in terms of spherical harmonics, 
unfortunately, we are not aware of any method for efficiently computing the spherical harmonics in 
high dimensions. Instead we will rely on a technique developed by HD for creating a randomized 
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kernel feature map for approximating the dot product kernel as follows: 

M 


Let, < f>j ( x ) 

= \J UMP M+1 X 

r= 1 

(9) 

Then, exp (x,y) 

P 

(10) 


3 = 1 


Here p is a hyper-parameter, usually taken to be 2. a m = ^ is the mth coefficient in the taylor 
expansion of exp and M is chosen by drawing a sample from a geometric distribution p[M = m] = 
and u> r is a binary random vector each coordinate of which is chosen from {—1,1} with equal 

chance. Refer to fill for detail^] 


5 Experiments 

We want to answer the following questions: (1) As a function of k, if we have access to a system 
that can retrieve Sk then what accuracy can be achieved by the proposed algorithms? (2) For a 
given k, how does the accuracy then change in the face of error in retrieval (such as would result 
from the use of an approximate nearest neighbor routine such as MIPS)? (3) What is the accuracy 
of our proposed methods as opposed to existing methods for estimating the partition function? 


5.1 Oracle Experiments 


In Section [3] in the discussion of the log-normal distribution of Z we explained how the number of 
neighbors needed for estimating Z was dependent on the value of Z itself j^] Our first set of exper¬ 
iments relies on real-world, publicly available collection of vectors: the neural word embeddings 
dataset released by m that consists of 3 million, 300 dimensional vectors, each representing a dis¬ 
tinct word or phrase trained on a, 100 billion token, monolingual corpus of news textJ^Each vector 
represents a single word or phrase. More pertinently, the dot product between the vectors Vi,Vj, 
associated to the vocabulary items uii, Wj respectively, represents the unnormalized log probability 
of observing wi given Wj: 


p(Wi\Wj) 


exp(vi ■ Vj) 

Ef=i ex P(^fc 'Vj) 


( 11 ) 


For our experiments we used the first 100,000 vectors from the 3 Million word vectors and all 
experiments in this subsection are on this set. Note that we do not normalize the vectors in any way: 
this ensures that we stay true to real-world situation in which vectors are the weights of a trained 
neural network, and consequently we can not modify them. 


In Figure [T] we show CDFs over words given context, sorted such that the words contributing the 
highest probability appear to the left. We can see that less than 1000 nearest neighbors (in terms 
of largest magnitude dot-product) are needed for recovering 80% of the true value of the partition 
function for the rare words Chipotle and Kobe JBryant , but close to 80K neighbors are needed for 
common words that have high frequency of occurrence in a monolingual corpus. This is explained 
by the fact that common terms such as “The” occur in a wide variety of contexts and therefore induce 
a somewhat flat probability distribution over words. These patterns indicate that the Naive MIMPS 
estimator would need an unreasonably large number of nearest neighbors for correctly estimating 
the partition function of common words and therefore we do not experiment with it further and focus 
on MIMPS, MINCE and FMBE. 


We implemented MIMPS, MINCE and FMBE based on an oracle ability to recover Sk, to which we 
then add errors in a deterministic fashion. The resultant estimates of Z are then tabulated based on 
their mean absolute relative errorQ 

also present one more algorithm for creating random feature maps that we would not discuss here. 
6 We defined Sk (q) to be the set of k vectors that have the highest inner product with the vector q in Section^ 
7 A1so see figure|l| 

8 URL: code.google.com/p/word2vec 

^Percentage Absolute Relative error p = 100j z ~^ z \ 
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Figure 1: CDF over vocabulary items sorted in descending order from left to right according to their 
individual contribution to the distribution. Every curve is associated with a distinct context word 
marked in the legend. The bracketed numbers in the legend indicate the frequency of occurrence of 
these words in an English Wikipedia corpus. We can see that high frequency, common words tend 
to induce flat distributions. 


Our query set consists of 10, 000 items taken from across the top 100, 000 vectors chosen initially. 
Each query represents the context (features) that are best “classified” by one of the many categories. 
In the case of word-embeddings and language models, this would be some preceding word context 
which would be extracted and used in measuring the surprisal of the next word in a sequence p*| We 
simulate this context by taking the representation of a given item from the vocabulary (a query vec¬ 
tor) and randomly adding varied levels of noise with controlled relative norms. Every experimental 
setting was ran three times with different seeds to maintain a low standard error. 

Table[l]presents the hyper-parameter tuning results for the different algorithms (UNIFORM, MIMPS 
and MINCE). We can see a symmetric behavior in the table for MIMPS which is surprising and we 
can see that the uniform case (which we model as a special case of MIMPS where k=0) performs 
badly. It is good to see that at k = 1000 and l = 1000 the error in Z is quite low but more exciting to 
see that when k = 100 and l = 100 then the error is only 7.1% with only 0.1% standard error. This 
means that by retrieving only 0.1% of the original vocabulary one can reasonably estimate the value 
of Z with low error. The MINCE estimator and the FMBE estimators do not fare well although 
the decrease in error of the MINCE algorithm as the number of noise samples is increased agrees 
with intuition. The FMBE algorithm had /t = 100 at D = 10000 and /i = 83.8 at D = 50000. 
The standard error in both cases was lower than 0.1. Clearly the FMBE algorithm would require far 
higher number of dimensions in the feature map created through random projections before giving 
reasonable results and it might be better to experiment with the newer methods for generating kernel 
feature maps that come with better theoretical guarantees, e.g. by G3D- We defer this investigation 
to future work. 

Table [2] shows the results of adding noise to the query vectors. As we mentioned before and inter¬ 
esting experiment for us was to restrictively simulate the type of errors that these estimators might 
encounter in a real setting where the vector with the highest or second highest inner product might 
not be made available to the estimators. We tabulated the performance of the estimators on these 


10 Surprisal being a function of the probability assigned by the model to an observation given context, and 
that probability assignment requiring the computation of Z. 
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o 

o 

o 

^ Jl 

o 

o 

1=10 

a 

Uniform 

101.8 

3.1 

117.3 

10.4 

97.3 

10.5 

MIMPS (k=1000) 

0.8 

0.0 

2.7 

0.0 

8.2 

0.1 

MIMPS (k=100) 

2.4 

0.0 

7.1 

0.1 

16.1 

0.2 

MIMPS (k=10) 

8.1 

0.1 

17.1 

0.3 

21A 

0.7 

MIMPS (k=l) 

28.7 

0.6 

39.3 

2.1 

47.0 

2.7 

MINCE (k=1000) 

96285.4 

2124.1 

12413.0 

363.9 

2527.3 

72.9 

MINCE (k=100) 

3780.9 

125.8 

667.4 

20.2 

846.5 

5.1 

MINCE (k=10) 

230.9 

7.9 

330.3 

2.1 

827.1 

5.0 

MINCE (k=l) 

133.7 

0.8 

317.3 

2.0 

525.2 

3.5 


Table 1: Mean absolute relative error, //, and their associated standard error, a, for different al¬ 
gorithms at varying settings of the hyper-parameters k and l that govern the number of vectors 
retrieved. 


type of errors in Table [3] It is disconcerting to see the huge increase in error when the most impor¬ 
tant neighbor is absent from the retrieved set and clearly the importance of neighbors decreases as 
their rank increases. This indicates that one should use retrieval mechanism that have a high chance 
of retrieving the single best nearest neighbor in practice. This evidence is important while deciding 
between different indexing schemes that solve the MIPS problem. 



noise= 

# b 

o 

II 

noise= 

10% 

a 

noise= 

=20% 

<7 

noise= 

30% 

a 

Uniform 

101.8 

3.1 

103.6 

3.1 

104.1 

3.1 

105.0 

3.1 

MIMPS 

0.8 

0.0 

0.9 

0.0 

0.9 

0.0 

0.9 

0.0 

MINCE 

230.9 

7.9 

229.9 

7.9 

233.7 

8.0 

231.5 

8.4 

FMBE 

83.8 

0.2 

85.2 

0.2 

85.8 

0.2 

87.1 

0.2 


Table 2: Results at varying levels of gaussian noise added to the query vectors to make them deviate 
from the actual. The header of the column indicates the norm of the noisy vector relative to the norm 
of the original vector. K and L were both set to 1000 for MIMPS and to 1 and 1000 for MINCE. 



ret err=None 

ret err=l 

ret err=2 

ret err= 

[12] 



g 

fl G 

k- 

a 


G 

MIMPS 

0.8 

0.0 

39.3 0.2 

6.1 

0.0 

45.0 

0.2 

MINCE 

133.7 

0.8 

133.7 0.8 

133.7 

0.8 

133.7 

0.8 


Table 3: The performance of the estimators with simulated retrieval errors in the oracle system, “ret 
err=None” represents no error where “ret” stands for retrieval and “ret err=l” represents that the 
most closest vector in terms of innre product was missing from the ,SV. retrieved by the oracle and 
“ret err=0 1” means that the first and second items were missing. We can see that the error increases 
as more and more items go missing. K and L were both set to 1000 for MIMPS and to 1 and 1000 
for MINCE. 


5.2 Language Modeling Experiments 

We now move beyond controlled experiments and do an end-to-end experiment by training a log 
bilinear language model lfl~3l on text data from sections 0-20 of the Penn Treebank Corpus. At 
test time we estimate the value of the partition function for the contexts in sections 21-22 of the 
Penn Treebank and compare the approximation to the true values of the partition function. We train 
the log-bilinear language models using NCE and clamp the value of the partition function to be 
one while training the language model which enables us to do evaluate the accuracy of our method 
against the most common usage of NCE for language modeling. For the following experiments we 
use the method MIMPS that we implement using the specific MIPS algorithm presented by 0 that 
in turn is implemented by modifying the implementation of K-Means Tree in FLANN lfl6l . 
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Remember that our goal is to estimate the true value of the partition function in the test corpus. There 
are two main hyper parameters in our approach, the number of “head” samples and the number of 
“tail” samples. We train the LBL language model with the dimensionality of 300 and context size of 
9 and tabulate the results as the number of head and tail samples is varied in table[4] We can see that 
with around 100 head samples and 100 tail samples the estimation accuracy becomes better than the 
heuristic of assuming that the value of Z is 1. 



l = 10 

l = 100 


AbsE-MIPS 

AbsE-NCE 

%Better Speedup 

AbsE-MIPS 

AbsE-NCE 

%Better Speedup 

k = 

10 

1063.5 

352 

34 

18.5 

728.5 

352 

47.5 13.5 

k = 

50 

989.5 

352 

46.5 

16 

554 

352 

61.5 13 

k = 

100 

229 

352 

55.5 

14.5 

198.5 

352 

70.5 10 


Table 4: AbsE column contains the total absolute difference between the estimated value of the 
partition function and the true value over the test set (Section 21-22 of the Penn Treebank Corpus) 
for the corresponding estimators. The test set contained close to 10,000 contexts. %Better refers 
to the number of times the MIPS estimator gives a better estimate than the NCE heuristic as a 
percentage of the total number of contexts in the test set. The Speedup refers to the speedup achieved 
over brute force computation by the corresponding MIPS method. 


6 Conclusions 

We presented three new methods to estimate the partition function of a neural network or a log- 
linear model using recent algorithms from the field of randomized algorithms for nearest neighbor 
search, new statistical estimators and randomized kernel feature maps. We found that it is possible 
to compute the true value of the partition function with a small number of samples both under ideal 
conditions where we have an oracle for retrieving the true set k vectors closest to a query vector and 
on at least one real dataset using the algorithm for MIPS described in a implemented using the 
FLANN toolkit. We also noted that the estimator MIMPS seems to be the most reasonable way to 
do so. Initially we were hopeful that the MINCE estimator could also be successfully used but we 
found that it did not work so well. 

While the data used for our experiments was always a real world dataset, we performed both con¬ 
trolled experiments where settings such as retrieval error and the creation of query vectors were 
carefully controlled to tease apart the sources of errors and end-to-end tasks. Based on the control 
experiments we can see that the performance of the algorithms critically depend on the indexing 
mechanism employed and it might be possible to extend some of the guarantees of those algorithms 
to our problem by using the results described in j9j- 

We also note that while a theoretical analysis of the performance of an estimator of the partition 
function would be extremely desirable, doing so for methods that rely on LSH that would need a 
three step analysis: (1) Analyze how the actual data (text or images) affects the weights learnt on 
the outer layer of a neural network. This process is not well understood. (2) How that distribution 
of weights would affect the performance of nearest neighbor retrieval. Perhaps the approach taken 
in @ could be extended for this purpose. (3). Finally, how the error in Nearest neighbor retrieval 
would affect the accuracy of the estimator. This analysis could be done by assuming some parame¬ 
ters distribution on the distributions of scores assigned to the output classes. We defer solutions to 
one or more of these steps to future work. 
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